convert-sha256: one-off SHA1 → SHA256 repo conversion#66
Conversation
Entire-Checkpoint: 198bc99d4b0b
Entire-Checkpoint: dc4592acaadf
Entire-Checkpoint: fb6dc8f73ee0
Entire-Checkpoint: 80a4babf00ee
This allows to "sign" the conversion. Entire-Checkpoint: 1857d92ce8c7
golangci-lint v2.11.4 was unhappy across errcheck, exhaustive, gocritic, inamedparam, ireturn, maintidx, noctx, perfsprint, revive, and wrapcheck. Mostly mechanical: - Test helpers (initSHA1, initSHA256, mustTranslator) collapse the '_ :=' patterns into t.Fatalf-on-error wrappers. - exec.Command -> exec.CommandContext(ctx, ...); ctx threaded into runChecks and signBranchTips. - refs.ForEach return value checked instead of discarded. - if/else chain in runChecks rewritten as switch. - ireturn for openSource/normalizeAuth annotated, since both return shared transport interfaces by design. - maintidx on Run annotated; the function is a phase orchestrator, splitting it would obscure the pipeline. - exhaustive switches on plumbing.ObjectType annotated; the unhandled cases (OFSDelta/REFDelta/AnyObject/InvalidObject) can't reach a resolved storer. - Errors from io.ReadAll/MemoryObject.Reader/bufio.Flush/fmt.Fprintln/ auth.Method.Authorizer wrapped with fmt.Errorf. - Constant 'max' renamed to package-level 'previewMax' to stop shadowing the builtin in two places. - Named parameter added to interface methods that lint flagged. - Two fmt.Sprintf calls replaced with string concatenation. Entire-Checkpoint: 43bf90985cc5
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 2b36d81. Configure here.
The previous substring check would pass on a commented `# objectformat = sha256` line or an `oldobjectformat = sha256` key in another section. Use go-git's config decoder to look up `extensions.objectformat` properly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 122fd96da7bf
Substring-matching "error" against the full fsck output could false-positive on benign dangling/warning lines that happened to contain that word (e.g. a branch or path with "error" in it). Scan line-by-line for "error:" / "fatal:" / "missing " / "broken link" / "bad " prefixes, which is what git itself uses to signal real problems. Entire-Checkpoint: f16d85f433d8
translate() is the single recursive entry point invoked from the tree-entry, commit-parent, tag-target, and message-ref loops. Check ctx.Err() at the top so Ctrl-C during a kernel-scale conversion returns promptly instead of running the whole DFS to completion. ctx is plumbed through newTranslator and stored on the translator (signatures of the many translateX helpers stay untouched). Entire-Checkpoint: 354e2e640545
Some HTTP v1 servers don't return HEAD in their info/refs. Previously HEAD was left at PlainInit's default refs/heads/master, which then fails the --check HEAD step on any main-default repo. pickHEAD now selects in order: server-advertised HEAD (resolved via the matched DesiredRef.TargetRef so user-supplied ref mappings are honored), refs/heads/main, refs/heads/master, then the lexicographically first branch. Tags-only conversions still leave HEAD at the PlainInit default since there's no sensible branch to point at. Entire-Checkpoint: 161329c78d89
- signBranchTips: explain the os.Stderr departure from req.Out; TTY inheritance is required for gpg/ssh-agent prompts. - Don't double-count when a commit/tag has both Signature and SignatureSHA256 (two encodings of the same signature). Relabel the warning to "signature(s) / mergetag header(s)" so the count matches what it actually represents. - writeOriginNotes now honors SOURCE_DATE_EPOCH and otherwise pins the wrapper-commit timestamp to the Unix epoch, so the notes-ref hash is reproducible across runs over identical source state. The timestamp is bookkeeping; it carries no information about the underlying SHA1 history. - writeLoose uses zlib level 1, matching git's core.looseCompression default — loose objects are short-lived before gc packs them, so write speed > size. - Clarify encodeBody's "scratch MemoryObject" comment so the unused format argument is no longer mystifying on re-read. Entire-Checkpoint: cabf3e19b5f2
The convert-sha256 docs promise that every branch and tag is always converted, since dropping any of them risks stranding cross-branch SHA1 references in commit and tag messages — the exact invariant the message-rewrite pass exists to maintain. The previous code piped req.ExcludeRefPrefixes straight into planner.BuildDesiredRefs, which applies it to branch and tag selection too, so the flag silently broke the promise. Validate at the top of Run: refuse any prefix that, under the planner's HasPrefix matching, would catch a refs/heads/* or refs/tags/* name. The check covers bare "", "refs/", partial "refs/h", whole "refs/heads/" / "refs/tags/", and any sub- namespace under either. Help text on convert-sha256 now also calls the rejection out. Entire-Checkpoint: ec9a87c6c8f1
writeRefs lands the source ref set on the target, then
writeOriginNotes and signBranchTips publish their own refs on
top. A source repo that already advertised refs/notes/sha1-origin
(under --all-refs) or any refs/tags/converted/* (always, since
tags are mandatory) would have those refs silently clobbered.
Add checkSideOutputCollision: run after planner.BuildDesiredRefs
and before any object work, refusing with an actionable message
that names the offending ref(s) and points at --no-origin-notes
/ --exclude-ref-prefix / dropping --sign as escapes.
Also tighten --check to skip *only* the side-output refs we
actually wrote: pass the {origin notes, signed tags} set into
runChecks instead of pattern-matching by prefix, so a legitimate
source ref that happened to share a namespace is not silently
hidden from the resolved/expected fraction.
Entire-Checkpoint: 250149a84f90
CI-blocking gofmt diff on the refs/notes/sha1-origin row. Entire-Checkpoint: 0ae6b701b541
The previous code assigned args[0] to SourceURL and args[1] to TargetDir unconditionally, so `--source-url <url> <dir>` left TargetDir empty (len(args) == 1, so args[1] was never read) and failed the "requires a source URL and a target directory" check. Consume positionals left-to-right against fields the user has not yet supplied via flags. The four flag/positional shapes (zero/one/two flags) now all yield the correct assignment. Extracted into resolveConvertSHA256Args for a unit test that covers every shape including the regression case. Entire-Checkpoint: 60184126410a
These are project-wide transport/auth interfaces, the same shape as the auth.Method entry that's already on the ireturn allowlist. Without the entries, mise run lint (which runs golangci-lint --fix) was racing: nolintlint stripped the //nolint:ireturn directives on openSource / normalizeAuth / newConn before ireturn flagged the functions, leaving lint red on every run. With the types allowed by policy, the directives become dead weight — drop them so the source documents intent once, in .golangci.yaml. Entire-Checkpoint: 9fa947bf886d
|
bugbot run |
1. Submodule gitlinks: drop the "vendored" carve-out. Even when the linked-to commit lives in the source store, rewriting the gitlink to SHA256 produces a tree that fsck-passes but breaks `git submodule update` forever — the .gitmodules upstream still advertises only SHA1. Refuse any submodule gitlink in discoverReachable; keep a defensive guard in translateTree. 2. --check HEAD on tag-only conversions: pickHEAD returns "" when no branches landed, so HEAD stays at the PlainInit default refs/heads/master and the HEAD check guarantees a failure after an otherwise successful run. runChecks now takes a hasBranches bool and marks HEAD as skipped when false, with a "tags-only conversion" detail. 3. Partial signed-tags list dropped on error: signBranchTips returns the tags it created before failing, but Run was assigning res.SignedTags = signed only on the success path and the cobra wrapper dropped result entirely on err. Assign res.SignedTags before the err check, and have the cobra wrapper print the partial result on error so users see which converted/* tags landed and need cleanup. 4. --keep-source-objects on error paths: the flag's whole purpose is debugging failed conversions, but cleanupTemp was only flipped at the *end* of Run, so every error before that wiped the temp store. Hoist the cleanupTemp = false / res.TempDir assignment to right after MkdirTemp, and propagate `res` through every subsequent error return so the kept path surfaces in both Result and Lines() output. Entire-Checkpoint: 451aea1c737a
5. discoverReachable now takes ctx and checks ctx.Err() at the top of every visit. Without it, Ctrl-C during the multi-minute discovery phase on kernel-scale repos was ignored — exactly the case the per-translate() check was added for. 6. writeLoose: document the durability tradeoff. No fsync, by design — convert-sha256 is a one-shot bulk operation, not incremental, processes millions of objects, and Run wipes the target on error so the only supported recovery is rerunning from clean state. 7. Target directory cleanup on error. ensureEmptyTarget refuses to write into a non-empty dir; without this fix, any failure after PlainInit left config/HEAD/refs behind and blocked a retry with no recovery hint. New deferred cleanup arms after PlainInit, disarms on success, and is suppressed by --keep-source-objects so users can inspect partial state. A --check failure also disarms cleanup since the conversion itself finished and the partial target is what the user needs to inspect. 8. hashPattern is now case-insensitive ((?i) prepended), and resolveMessageRef lowercases the input before lookup. An uppercase/mixed-case SHA1 reference in a commit or tag message is now rewritten the same as a lowercase one. 9. Check gained a Skipped bool. Skipped implies OK so callers gating on OK still treat it as non-fatal; callers needing a stricter audit signal can branch on Skipped first. Applied to the fsck-when-git-missing and HEAD-on-tags-only paths, with a "○" glyph in the progress output to distinguish from real passes. Tests cover discovery cancellation, fsck skipped, uppercase hash rewrite, and the existing tag-only HEAD check rewritten against the Skipped field. Entire-Checkpoint: 248018a9dd16
…ose, memoization
10. fsckHasError used bufio.Scanner with the default 64 KiB buffer,
so a single long line (paths in repository-scale fsck output)
would silently truncate and we'd miss whatever followed it.
Switch to bytes.Split on raw newlines.
11. discoverReachable was recursive, which on long linear chains
(kernel-scale: ~70k commits on the deepest single-parent path)
grew the goroutine stack into the tens of MiB. Iterativize
with an explicit work stack — memory now scales with the
in-flight frontier, not the longest chain. translate() stays
recursive: its edges are dynamic (tree entries + commit parents
+ tag targets + message-reference edges resolved against the
in-progress mapping), and rewriting it would risk silent
corruption of message rewrites. Documented the rationale and
the depth math.
12. translator.dst was set in newTranslator but never read — a
landmine inviting a future contributor to call go-git's
SHA1-hardcoded SetEncodedObject. Field removed; the type
assertion stays (with `_`) so a memory-backed dst is still
refused with a clear error.
13. fsckHasError's error-line matcher was prefix-only and case-
sensitive ("error:" / "fatal:"). Broaden to any line whose
first token starts with "error" or "fatal" (case-insensitive),
keep the "missing "/"broken link"/"bad " object reports for
older git. Closer to the previous substring match's coverage
without the path-substring false positive.
14. writeMappingFile dropped Close errors via a deferred Close.
On NFS or quota-bound filesystems write failures surface at
close time, not flush time, so the caller would think the
mapping landed when it hadn't. Explicit Close on the success
path, with the deferred Close kept as a best-effort net for
the failure path.
15. resolveMessageRef ran a full O(len(reachable)) scan for every
short hash prefix, hit twice per token (once by
extractMessageReferences, once by rewriteHashesInMessage).
reachable is frozen before translation starts, so the
(prefix → matchResult) mapping is stable; memoize it on the
translator. Halves the documented quadratic ceiling on
message-token resolution.
Tests cover the long-line + case-insensitive fsck path and the
resolveMessageRef cache (second call returns the cached answer
even after reachable is mutated).
Entire-Checkpoint: bc9f9981f089
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 26f70ca. Configure here.
| reference cycles are cryptographically infeasible), but a trip into | ||
| the guard becomes a hard error rather than a stack overflow. | ||
|
|
||
| Loose object writing is done by hand rather than via go-git's |
There was a problem hiding this comment.
This is no longer the case on go-git/v6@v6.0.0-alpha.4:
| that walks Jira tickets, PR bodies, deploy manifests, or any other | ||
| system that holds frozen SHA1 references. | ||
|
|
||
| ### Branch-tip attestation tags (opt in via `--sign`) |
There was a problem hiding this comment.
Perhaps we could switch --sign with --sign-mode which defaults to none, but then has tips (current implementation) and later could have an all, which could sign all objects, to support workflows whereby users want to enforce all commits/tags being signed.
|
|
||
| const ( | ||
| originNotesRef = "refs/notes/sha1-origin" | ||
| attestationTagPrefix = "refs/tags/converted/" |
There was a problem hiding this comment.
Not sure I understand the premise here. Are there any concerns on signing tags in place?
Summary
New
git-sync convert-sha256subcommand. One-off conversion: fetches a pack over smart HTTP from a SHA1 source, walks every reachable object, and writes a fresh SHA256 bare repository where every tree/commit/tag reference is re-encoded under SHA256. All branches and tags are always converted (partial scope risks stranding cross-branch references in messages). Behavior, sharp edges, and operational characteristics documented indocs/convert-sha256.md.Design: two-pass topological DFS
The translation needs an authoritative "what's in scope" view before encoding starts, so abbreviated SHA1 prefixes in commit messages get the same uniqueness verdict regardless of how far translation has progressed, and so cross-branch message references (cherry-picks, reverts, etc.) can be added as dependency edges in the DFS:
tree/parentlinks, and tag targets. Produces areachable map[SHA1]ObjectTypecovering the full in-scope set. Submodule gitlinks pointing outside the repo error here, before the target bare repo is materialized — failing fast keeps half-converted state off disk.Cycles in the combined dependency graph are cryptographically infeasible (forming a cycle via message references would require knowing each hash before computing it), so the topological order is always valid. A defensive
inProgressguard turns unexpected graph shapes into a clear error rather than a stack overflow.Loose objects are written by hand —
go-git@v6.0.0-alpha.3'sobjfile.Writerhardcodes SHA1 in its hasher, so the standardSetEncodedObjectpath would place every translated object at a SHA1-derived filename even on a SHA256 store. A unit test recomputessha256of every loose object's decompressed content and compares against the filename to prevent regression.Signing (
--sign)After conversion,
--signshells out togit tag -s converted/<branch> <tip>for every converted branch. Each resulting signed annotated tag transitively attests the entire reachable history of that branch via the parent chain: tampering with any historical commit changes ancestor hashes, which changes the tip's hash, which changes the tag'sobjectline, which invalidates the signature.The signature is by the converter, not by the original authors — the originals' signatures are necessarily lost (they signed over pre-rewrite bytes). For internal mirrors and single-identity repos that's a strict improvement over unsigned-everywhere; for broad public repos it's weaker than the pre-conversion chain. SSH signing (
gpg.format = ssh) and OpenPGP both work — we delegate togit, so whatever the user's git config picks is honored.--sign-key <id>overrides the default key.Example output
--sign --checkagainstentireio/cli(283 refs, 6,950 commits, 55,289 objects):Inspecting one of the signed tags:
git verify-tag refs/tags/converted/main(against the user's allowed-signers file) then verifies the entire history.Other flags
--all-refs/--exclude-ref-prefix— extend scope to notes/pulls/etc., or subtract--write-mapping <path>— emit SHA1 → SHA256 TSV for rewriting external systems (PR descriptions, deploy manifests, etc.)--no-rewrite-messages— skip inline rewriting of hash references in commit/tag messages--no-origin-notes— skiprefs/notes/sha1-origin(per-commit reverse lookup of original SHA1)--check— post-conversion sanity steps (config, HEAD resolves, refs resolve,git fsck --full)--progress— live per-phase object counts (TTY only)--sign-key <id>— overrideuser.signingkeyfor--signFull reference in
docs/convert-sha256.md.Test plan
go test ./...passesgo test -race ./cmd/git-sync/internal/sha256convert/clean (atomic counters used so the--progressticker doesn't race against translation)sha256(content)invariant for every loose object, message rewriting (ancestor + cross-branch + ambiguous + skip), origin notes, mapping file, submodule fail-fast in discoverygit-http-backend(gated onGITSYNC_E2E_SHA256_HTTP_BACKEND=1): exercises pack fetch + discovery + translation + rewriting + notes + mapping +--checkend-to-end, assertsgit fsckcleanssh-keygen, pointsGIT_CONFIG_GLOBALat a generated gitconfig with SSH signing, runsconvert-sha256 --sign, asserts the resultingrefs/tags/converted/mainis a signed annotated tag targeting the converted branch tipspf13/cobra(37 refs, ~4,500 objects) andentireio/cli(283 refs, ~55,000 objects);git fsck --fullclean;git verify-tag refs/tags/converted/mainpasses against an SSH allowed-signers file🤖 Generated with Claude Code
Note
High Risk
Large new conversion path touches object encoding, ref policy, and optional signing; mistakes could produce corrupt or misleading repos, though submodule/side-output guards and
--checkmitigate operational risk.Overview
Adds
git-sync convert-sha256, a one-off migration that pulls a SHA1 repo over smart HTTP into a new empty SHA256 bare directory, re-encoding all reachable objects and refs (branches/tags always; optional--all-refs/--exclude-ref-prefixwith guards so heads/tags cannot be dropped).The implementation lives in
cmd/git-sync/internal/sha256convert: temp SHA1 fetch → reachability discovery (submodule gitlinks fail before the target is created) → manual SHA256 loose-object writes (workaround for go-git’s SHA1 object writer) → ref/HEAD setup, optional message hash rewrites,refs/notes/sha1-origin, TSV mapping,--signattestation tags, and--check. The Cobra layer prints partial results on failure and fixes positional vs flag arg resolution.Docs/README document usage and sharp edges (signatures stripped, submodules). Tests cover the translator, CLI arg resolution, and gated
git-http-backende2e; golangci allowsgitproto.Conn/AuthMethodreturn types.Reviewed by Cursor Bugbot for commit 26f70ca. Configure here.